## Loading required package: grid
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lets begin by showing dataset variables
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Now I need to get more details about the types of variables in the dataset
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
I can see that this dataset has 1599 observations with 13 variables. all the variables are of type num except for X and quality which are of type int.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This summary shows that X represents observations numbering or identifiers So it has no effect on the quality of the red wine. we can ignore it.
The quality is an ordered, discrete variable.
Quality of 75% of red wines are less than or equal to 6.
The other variables are continuous variables.
median fixed.acidity is 7.90.The max volatile.acidity is 1.58.The median PH is 3.31
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It seems like quality has a normal distribution with discrete values of quality (3, 4, 5, 6, 7 amd 8).
Almost 640 wine have quality 5, 620 have quality 6, then 7, 4, 8, and finally 3 with the least number of wines
we can categorize the qaulity into 3 categories (bad, fair and good) by creating new categorical variable called quality_rating
## bad fair good
## 63 1319 217
Here we have 63 bad wines, 1319 fair wines and 217 good wines.
The most dominant quality is the fair quality
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The last plot shows that volatile.acidity has a long tailed distribution.
I can hardly see that volatile.acidity is normally distributed.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
fixed.acidity has a long tailed distribution.
fixed.acidity has a normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## FALSE TRUE
## 1467 132
As we can see most wines have citric.acid between 0 and 0.5. 132 red wines have 0 citric.acid value.
The last plot shows that citric.acid is not normally distributed.
Now I’ll create new varible represents total fixed acids of wine (fixed.acidity + citric.acids). lets call it total.fixed.acids
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.750 7.240 8.170 8.591 9.560 16.550
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
total.fixed.acids variables has a long tailed distribution.
Plotting total.fixed.acids on log 10 base scale reveals that total.fixed.acids is normally distributed.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The last plot show that most wines have residual.sugar value between 1.5 and 3. It also shows that residual.sugar has heavy tailed distribution with many outliers.
Even log10 scale of residual.sugar reveals this heavy tailed distribution.
I thought of creating new variable classifying red wines into 2 categories (sweet and non-sweet) wines but in the dataset the max value of residual.sugar is 15.500 and the wine is considered sweet if it has at least 45 residual.sugar.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The last plot show that most wines have chlorides value between 0.03 and 0.13. It also shows that chlorides has heavy tailed distribution with many outliers like residual.sugar.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
free.sulfur.dioxide is long tailed distributed.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
total.sulfur.dioxide is long tailed distributed like free.sulfur.dioxide.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## Warning: position_stack requires constant width: output may be incorrect
It is obvious that density is normally distributed
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
PH has a normal distribution with few outliers like density.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
sulphates has a long tailed distribution.
Transforming sulphates on log10 base shows that sulphates has a normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
alcohol has a long tailed distribution.
## Warning: position_stack requires constant width: output may be incorrect
Still has a non normal distribution with a long tail.
1599 red wines in the dataset with 13 features (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All variables are num except for X and quality which are int.
Other observations: The quality is a discrete variable while all others are continuous variables.
Quality variable has a normal distribution
volatile.acidity, fixed.acidity and sulphates appear to have normal distribution when plotting them on log 10 base.
chlorides and residual.sugar have heavy tailed distribution with alot of outliers.
free.sulfur.dioxide and total.sulfur.dioxide have a long tailed distribution.
density and pH are normally distributed.
The quality of 75% of red wines are less than or equal to 6.
Many wines have 0 citric.acid
Min Quality is 3 and Max quality is 8.
Median fixed.acidity is 7.90.
Max volatile.acidity is 1.58.
Median PH is 3.31.
X variable is just an identifier of the observations.
I am very interested in the quality of red wine. I want to explore the variables affecting it.
from googling and the variables descriptions, I think that the bellow variables will support my investigation into the quality variable
1- Acids [Fixed, Volatile and citric]
2- alcohol
3- pH
4- total sulfur dioxide
1- quality_rating: which is a categorical variable of quality variable
2- total.fixed.acids: sum of fixed and citric acids in wine
there were long tailed and heavy tailed distributions besides normal. All I did with these data just setting binwidth and transform data to get better visualization.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## total.fixed.acids 0.99704157 -0.294847154 0.72665884
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.fixed.acids 0.121334969 0.108045144 -0.148947646
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## total.fixed.acids -0.10127190 0.65737801 -0.68958445
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
## total.fixed.acids 0.202161691 -0.04578490 0.13852654
## total.fixed.acids
## fixed.acidity 0.9970416
## volatile.acidity -0.2948472
## citric.acid 0.7266588
## residual.sugar 0.1213350
## chlorides 0.1080451
## free.sulfur.dioxide -0.1489476
## total.sulfur.dioxide -0.1012719
## density 0.6573780
## pH -0.6895844
## sulphates 0.2021617
## alcohol -0.0457849
## quality 0.1385265
## total.fixed.acids 1.0000000
we can see that quality has a moderate positive correlation with alcohol (0.476) and negative correlation with volatile.acidity (-0.391).
pH is highly correlated with both fixed.acidity (-0.683) and citric.acid (-0.542) which is meaningful relationship refering to pH description
Also note that free.sulfur.dioxid correaltes with total.sulfur.dioxide (0.668) which is meaningful as free.sulfur.dioxid is subset of total.sulfur.dioxide.
Finally we can see that total_acids is correlated with fixed.acidity (0.996) and citric.acid (0.690) which seems logical because total acids variable is the sum of the 3 acids.
This plot shows that high quality wine has high value of alcohol.we can also notice the vertical strips which indicates that quality is a discrete variable taking one of these values (3, 4, 5, 6, 7, and 8).
high quality wine has low value of volatile.acidity which matches the effect of high level of volatile.acidity on the quality (in volatile.acidity variable description)
This plot shows that wines with low quality 3 and 4 have max fixed.acidity value around 12. while the other higher qualities have higher values of fixed.acidity, that reveals that high quality wine has high value of fixed.acidity
This plot shows that high quality wine has high value of citric.acid
The plot doesn’t show any impact of residual.sugar on quality.
This plot shows that most wines with high quality 7 and 8 have chlorides value around 0.1. while the other lower qualities have higher values of chlorides, that reveals that high quality wine has low value of chlorides.
The plot shows that free.sulfur.dioxide has no impact on quality.
The plot reveals that alot of top highest quality wines’total.sulfur.dioxide value don’t exceed 100 while others have total.sulfur.dioxide exceeding 100.
So we can hardly see that high quality wine has low total.sulfur.dioxide value.
low quality has high value of density while high quality has lower value of density.So we can say that high quality wine has low density value.
It is roughly seems that high quality wine has low pH value
high quality wines have higher values of sulphates than low quality wines.
high quality wines have high value of total.fixed.acids
The plot shows an exponential relationship between total.fixed.acids and pH.
The plot reveals a linear relationship between total.fixed.acids and fixed.acidity
quality correlates fairly with alcohol and volatile.acidity, high quality wines have high value of alcohol and low value of volatile.acidity.
quality has low positive correlation with fixed.acidity, citric.acid, sulphates and total.fixed.acids.It has low negative correlation with density, total.sulfur.dioxide and chlorides.
It seems that quality has a weak correlation with pH, residual.sugar and free.sulfur.dioxide.
Yes I observed the relationship between total_acids and pH Which seems to be exponential with strong negative correlation, which is logical as pH is a measure of acids.
Also the relationship between fixed.acidity and total.fixed.acids was observed linear relationship.
The relationship between fixed.acidity and total.fixed.acids
In this section I ’ll explore the most interesting variables that may affect the quality in conjunction with quality and quality_rating variable.
This plot shows a weak negative correlation between volatile.acidity and alcohol. we can notice that good wines have high values of alcohol and low values of volatile.acidity.
There is a strong relationship between fixed.acidity and citric.acid. it is also clear that good quality has high value of both fixed.acidity and citric acid.
The strong relationship between free.sulfur.dioxide and total.sulfur.dioxide is clear.we can also notice that free.sulfur.dioxide has almost no effect on the quality.all qualities take almost the same range of free.sulfur.dioxide’s values. But for total.sulfur.dioxide we can hardly see that high quality wines have high value of free.sulfur.dioxide.
This plot shows the relationships between pH and both of fixed.acidity and total.fixed.acids which seem to be strong.
As fixed.acidity or total.fixed.acids increases the pH decreases. It is also clear that good wines have high values of fixed.acidity and total.fixed.acids, and low values of pH.
There is a weak correlation between alcohol and pH. good wines have higher alcohol and lower pH values than bad and fair wines.
This plot shows no relationship between alcohol and total.fixed.acids but it reveals that good wines have high values of both alcohol and total.fixed.acids.
By faceting the plots by quality rating, I can visualize the relationships between many variables and thier impact on the quality.
Starting with alcohol which has the highest correlation with quality, I notice that when alcohol increases the volatile.acidity decreases and wine quality increases.which is meaningful as we know from variable description that good wines have low value of volatile.acidity.
There is no relationship between alcohol and total.fixed.acids but both variables correlate with quality.good wines have high values of alcohol and total.fixed.acids.
Fixed.acidity and citric.acid are correlated to each others and have a little impact on quality. good wines have high values of both Fixed.acidity and citric.acid.
free.sulfur.dioxide and total.sulfur.dioxide are strongly correlated. we can see no impact for free.sulfur.dioxide on the quality but total.sulfur.dioxide has a little positive impact on the quality.
Finally pH seems to have strong correlation with both fixed.acidity and total.fixed.acids which is meaningful as pH is a measure of fixed.acidity and fixed.acidity is subset of total.fixed.acids.it also has a weak correlation with alcohol.
From googling for pH variable I Knew that it is almost the backbone of wine quality but surprisingly I found almost no relationship between quality and pH it is extremely weak.
No
This plot demonstrates 2 forms of wine quality. the first is the initial quality variable which takes an ordered discrete value from 3 to 8.
The second form represents my created variable quality_rating which categorizes the quality values into 3 categories [bad, fair, and good].
we can see that the most dominant quality in the dataset is fair [5 and 6], then good [7 and 8], and least one is bad [3 and 4]
This plot shows the impact of volatile.acidity and pH on the quality.we can clearly see that good wines have low values of both volatile.acidity and pH.
This plot demonstrates the impact of alcohol on quality rating, the median values of alcohol for bad and fair wines is 10. good wines have median value of alcohol around 11.good wines have high values of alcohol.
This dataset contains 1599 observations of 15 variables including quality varible that is my interesting feature. I began exploring all the variables individually.
After that I created a new categorical variable ‘quality_rating’ which categorizes the quality in a meaningful term rathar than the quality numbers.I have also collected all fixed acids variables (fixed.acidity and citric acids) into one variable total.fixed.acids.
I explored the quality variable across all the other variables to understand the impact of the variables on the quality.
I am able to specify that the main features affecting the quality are alcohol and acidity.there is also other features that have a low impact like sulphates, density and pH.
I would be interesting in creating a linear model and testing its accuracy